home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
CD School House 10
/
CD School House - Education and Games (10.0) - Wayzata Technology (1995).iso
/
mac
/
DOS
/
MISC
/
MVSP13
/
MVSP.DOC
< prev
next >
Wrap
Text File
|
1994-03-03
|
53KB
|
1,142 lines
_______________________________________________________
MMMMMMMMMMMM VV VV SSSSSSSS PPPPPPPP
MM MM MM VV VV SS PP PP
MM MM MM VV VV SSSSSSSS PPPPPPPP
MM MM MM VV VV SS PP
MM MM MM * VVV * SSSSSSSS * PP *
_______________________________________________________
A MultiVariate Statistics Package for
the IBM PC and Compatibles
(C) Copyright Warren L. Kovach, 1986
Department of Biology
Indiana University
Bloomington, IN 47405
Ver. 1.3, Feb., 1986
This program is being distributed as user-supported
software. If you find this program to be of value,
a voluntary contribution ($25 suggested) would be appreciated.
MVSP Ver. 1.3 -- User's Manual Page 2
CONTENTS
--------
Introduction....................................................3
Acknowledgements................................................3
Disclaimer......................................................4
General Use of Program..........................................4
Main Menu Options.............................................5
A-E: Statistical Procedures.................................5
F: Change Drive or Sub-directory............................5
G: Change Program Defaults..................................5
H: HELP!....................................................7
Q: Quit MVSP................................................7
Data Files....................................................7
Data File Header:...........................................7
Data Labels:................................................8
Data File Titles:...........................................8
Data Matrix:................................................9
Running Statistical Procedures...............................10
Principle Components Analysis:.............................11
Reciprocal Averaging:......................................12
Dissimilarity and Similarities:............................13
Cluster Analysis:..........................................14
Diversity Indices:.........................................15
Future Plans...................................................15
8087 Support...................................................16
The User Supported Concept.....................................16
Appendix: Test Data Files......................................18
References.....................................................19
MVSP Ver. 1.3 -- User's Manual Page 3
INTRODUCTION
MVSP is a package of common multivariate statistical
procedures widely used in many areas of biology and geology, as
well as other fields. These procedures include principle
components analysis, reciprocal averaging, distance or
dissimilarity measures, average-linkage cluster analysis, and
diversity indices. These procedures are geared towards quick,
simple analyses of small to medium sized data sets. Any heavy
number crunching would be best suited for mainframe computers or
some of the more sophisticated microcomputer statistical packages
which are available. However, the price and simplicity of use of
MVSP is hard to beat!
I've tried to make this program as easy to use as possible.
One possible drawback to ease of use is that some users may be
very tempted to take a "black box" approach to using these
statistics, feeding in numbers and coming up with "The Answer".
I must strongly warn the users of this program that statistics
can be DANGEROUS!
All these procedures make assumptions about the data and have
restrictions on what they can and cannot do. If these assumptions
and restrictions are violated, the results could be meaningless.
I urge you to become familiar with the methods and their
assumptions before you use this program. This manual contains a
list of references which I have found very useful in
understanding these techniques. In particular, Sneath & Sokal
(1973), Gauch (1982), and Pielou (1984) are very well written and
give very clear discussions of these techniques.
ACKNOWLEDGEMENTS
This program is written in Turbo Pascal, and compiled using
the version 3.0 compiler. The procedures for producing the pop-
up menus and the disk directory listings are modified from Philip
R. Burns' public domain procedures PIBMENUS and PIBDIR, both of
which are incorporated into his PIBTERM program. These procedures
are widely available on many electronic bulletin board services
across the country. Check with your local users groups for more
information, if you haven't already been bitten by the BBS bug.
The assembly language procedure for direct memory video
output is from Steve Hall's contribution to "PC-Magazine's" Power
User column (Oct. 1, 1985). The eigenanalysis algorithm used in
the principle components analysis and reciprocal averaging
procedures is translated and modified from Orloci's (1978) BASIC
programs. The scattergram procedure in the PCA and RA procedures
is translated and modified from Cooke, Craven, and Clarke's book
"Basic Statistical Computing", a very nice book with BASIC
programs for doing numerous types of statistical analyses. The
sort procedure used in the Spearman coefficient procedure is
taken from Jim Savold's ZIPSORT procedure (ver. 1.1)
MVSP Ver. 1.3 -- User's Manual Page 4
DISCLAIMER
The accuracy of this program has of course been extensively
tested against the results of other programs, but the results are
not guaranteed. You may wish to initially also run comparisons
with the results of other programs, using your own data set, to
ensure that it is working properly with your type of data. We
all know about those demons which manage to get into computer
programs, causing foul-ups when we least suspect it!
Note when running comparisons that there are often many
methods of computing the same thing, and results may vary,
especially in the more complex principle components and
reciprocal averaging procedures. In principle components
analysis, for instance, there are numerous ways of transforming
the data before eigenanalysis, and the component loadings can be
scaled either to unity (as they are here) or to the variance of
that principle component. These differences may have great
effects on the results, and should be kept in mind.
If you do run into any problems with this program, whether
they be in the results or abnormalities in the running of the
program, please contact me at the address given on the title
page, or through PC-LINK CENTRAL in Bloomington (812-824-7990),
and give details of the problem and, if possible, the data set
which you were running when the bug cropped up.
Please note that no warranty is given for this program. The
author (Warren L. Kovach) shall not be legally liable for any
damages or lost profits arising from use or misuse of this
program.
GENERAL USE OF PROGRAM
This program is a simple to use, menu-driven program which
presents you with the possible options at each step. The program
is initiated by typing the name of the program, MVSP, at the DOS
prompt. Note that there are two files which are necessary for
this program, MVSP.COM and MVSP.000, and these must both be on
the default drive when the program is started. If you have
changed any of the program defaults, the configuration file named
MVSP.CNF (which is created when you save your changes) must also
be on the default drive.
When the program is loaded, you will see an introductory
screen giving the name and address of the author, then you will
be presented with a menu of available procedures. The first
option on the menu will be highlighted by a rectangular cursor.
This cursor can be moved up and down the list of options by using
the up and down arrow keys on the numeric keypad of the keyboard.
A choice of option is made by hitting the carriage return when
the correct option is highlighted, or alternatively by typing the
letter preceding the desired option.
MVSP Ver. 1.3 -- User's Manual Page 5
MAIN MENU OPTIONS
=================
OPTIONS A-E:
The first five options are for the basic statistical
procedures; PRINCIPLE COMPONENTS ANALYSIS, RECIPROCAL AVERAGING,
SIMILARITIES AND DISSIMILARITIES, CLUSTER ANALYSIS, and DIVERSITY
INDICES. These procedures are described later in this document.
OPTION F:
This option, CHANGE DRIVE OR SUB-DIRECTORY, allows you to
temporarily change the drive and sub-directory on which the input
and output data files will be found by default. If you enter a
path name without a drive specification, the default drive is
assumed. If you enter just a drive specification (e.g. "A" or
"A:" or "A:\") the default path will be the root directory of
that drive. A "?" lists the sub-directories in the currently
logged directory. A carriage return with no other input exits
this option with no changes.
OPTION G:
The CHANGE PROGRAM DEFAULTS option allows you to change the
initial default colors, path name, and data file extensions.
These default specifications can be saved to the file MVSP.CNF,
which will be reloaded each time the program is run, reinstating
these defaults. When you choose this option you will be
presented with a menu asking which type of default should be
changed.
DEFAULT COLORS allows you to change the color of the regular
text and background, the menu text and background, and the menu
frame. Choosing one of these will cause a menu of available
colors to appear. You can experiment with color combinations
easily, quitting the color menu when you are satisfied. Note
that option "F" on the menu resets black and white colors, which
are the defaults if the MVSP.CNF configuration file is not found.
This option can be useful in case you get yourself into a color
combination that is so unreadable that you can't see the options
available!
DEFAULT DATA FILE PATH changes the default path used for data
files, just like option F above. However, this option allows you
to save this specification for future use, while option F is for
temporary changes. If you are using a two floppy disk system, it
is often most useful to have the program files in drive A:, and
to have the default data file path set to B:, so that data files
are on another disk. If you have a hard disk, you could have the
program files in a subdirectory named C:\MVSP (which would be the
default directory when you invoke the program) and the data
either on a floppy in drive A: or B:, or in a hard disk directory
named C:\MVSP\DATA. You would then specify the default data file
MVSP Ver. 1.3 -- User's Manual Page 6
path through this option. You can even set up separate
directories for different types of data, which is where the
temporary path change option (option F) would come in handy. You
can always override the default path option by either changing it
through options F or G, or by specifying the drive and path when
you are asked for the name of the data file when running one of
the statistical procedures.
DEFAULT DATA FILE EXTENSIONS allows you to change the default
extensions for your input and output files. I personally prefer
*.DAT for input files and *.OUT for output files (these are the
internal defaults used if MVSP.CNF are not found), but you can
easily change this and save your changes.
The cluster analysis program can have different defaults,
which facilitates the input of similarity or dissimilarity
coefficients from this program to the cluster procedure. The
coefficients program can output a symmetrical matrix to a file in
the form required by the cluster procedure. The filename
extension of this file will default to the extension which you
specify for cluster analysis input (*.DIS is the internal
default). Thus, to perform a cluster analysis of the file
DATA.DAT, you need only to enter the name DATA in both the
similarity procedure and the cluster procedure. The similarity
matrix will be calculated for DATA.DAT, placed in DATA.DIS, and
read from DATA.DIS by the cluster procedure. The output file for
the cluster program can also have its own default extension
(*.CLS is the internal default).
Entering a blank carriage return for the output file
extensions will direct output to the default printer (Lst)
instead of a file. Entering "NUL" will nullify any hard copy
output, and you will only see the results printed to the screen.
MINIMUM EIGENVALUE allows you to control the number of
components which are printed out in the PCA and RA procedures by
changing the value for the minimum eigenvalue. More on this in
the section on PCA.
REREAD CONFIGURATION FILE will reread the MVSP.CNF
configuration file which contains the user default settings.
This will reinstate the default settings which are normally
active when the program is initiated. This can be handy if you
have made a lot of changes to defaults during a session (without
saving them!) and you wish to return to your old defaults.
SAVE DEFAULTS TO FILE MVSP.CNF will save any changes in the
defaults to a configuration file, which will be reloaded every
time the program is run. If this file is not found on in the
same directory as the other MVSP program files, the internal
defaults will be set. If any changes are made to the defaults,
and you attempt to exit the configuration menu without saving
them, you will be reminded that these new defaults have not been
saved and given the option to return and save these options, or
continue back to the main menu.
HELP! will provide abbreviated descriptions of the options of
the configuration menu.
MVSP Ver. 1.3 -- User's Manual Page 7
QUIT CONFIGURE will return you to the main menu.
OPTION H:
HELP! will provide descriptions of the main menu options as
well as information about the expected format of the data files
and the author's name and address.
OPTION Q:
QUIT MVSP will exit the MVSP program and return to the DOS
prompt.
DATA FILES
==========
The input data files should be ASCII text files which can be
created with the DOS line editor EDLIN, or many other word
processors, such as PC-WRITE or XYWRITE. Some word processors,
such as WORDSTAR, modify some characters to special formatting
characters ("high bits"). These modified characters will not be
able to be read by MVSP. You can check whether your word
processor is one of these by listing a word processed file with
the DOS TYPE command and looking for strange characters. If your
word processor uses these extra characters, make sure you create
your data file in a non-document mode which creates normal ASCII
files.
You may also maintain your data with spreadsheet or database
programs, such as LOTUS 123. Most of these have an option for
printing data to ASCII files, which can then be modified to the
appropriate format for MVSP (mainly by adding the file header
information, discussed below). This can greatly expedite data
management and manipulation, making it easier to select species
or sites to be analyzed.
DATA FILE HEADER:
The first line of the data file should be a header line,
which will give the program some information about the data, such
as the number of rows and columns. It should look something like
this:
* 10 15
This header line should begin with an asterisk ("*") in the first
column of the first line of the file. This asterisk tells the
program that a header is present. If the asterisk is not found,
the program assumes that the header information is not present,
and it will prompt the user for the information. The two numbers
are the number of rows and columns in the data matrix. The above
example has 10 rows and 15 columns. MAKE SURE that if this
header information is present, there is an asterisk before it; if
MVSP Ver. 1.3 -- User's Manual Page 8
not, the header information will be read as data!
You may also include data labels in the data file. These
labels will be printed on your output to help make sense of the
masses of numbers which will be spewed out. If labels are
included, this must be specified in the file header. For
example:
*L 10 15
specifies a data file which includes data labels and which has 10
rows and 15 columns (NOT including the labels themselves). The
"L" must come immediately after the "*", with no intervening
spaces, or it will be read as the number of rows, and an error
will occur. The numbers of rows and columns must be separated by
at least one space from each other.
DATA LABELS:
The column and row labels themselves can be up to 8
characters long and may consist of any printable character,
except spaces. The following are all valid labels:
ROW1
COLUMN_2
1st-Loc.
#3-Site
This label is NOT valid:
SITE 1
It will be read as two labels, "SITE" and "1".
The column labels should be in the second row of the data
file, after the header line, and the labels should be separated
by at least one space. The labels may be continued onto
subsequent lines; the program will continue reading column labels
until it has read as many as the number of columns you have
specified in the header line.
Row labels occur on the same line as the data row to which
they apply, and should precede the first datum in that row, with
a space separating the label and datum.
DATA FILE TITLES:
A title may also be added to your data file on the header
line, so that you know what this data represents. Here's an
example
*L 10 15 Test data file for MVSP
This title, "Test data file for MVSP", will be listed to the
screen and placed on the output when that file is selected. It
MVSP Ver. 1.3 -- User's Manual Page 9
must be separated from the other elements of the header by at
least one space, and it cannot be more than 70 characters long.
The dissimilarities procedure will also place this title in the
header of the matrix output file, along with the specification of
which coefficient was used, so that the title is carried over to
the clustering program.
DATA MATRIX:
The data matrix itself should consist of the data points
separated by at least one space. The data for one row can be
continued on the next line. If the number of rows or columns you
specify is wrong, the data matrix will be read wrong, often
without warning. If you have a 10x10 matrix and specify 9
columns by mistake, the last datum on the first row will be read
as the first datum of the second row, and so on. This, needless
to say, can raise havoc with your results! BE CAREFUL! All
procedures can print out the raw data so that you can check to
make sure it was read correctly.
Here is an example data file:
*L 5 10 Test data set for MVSP
COL1 COL2 COL3 COL4 COL5 COL6 COL7 COL8 COL9 COL10
ROW1 23 2 4 53 6 45 2 3 67 5
ROW2 10 2 4 34 1 4 3 10 20 3
ROW3 2 34 0 1 35 12 1 90 10 9
ROW4 98 12 10 4 10 9 10 5 20 31
ROW5 1 7 9 11 75 7 5 21 0 10
The input data files for the cluster analysis program use a
slightly different header format. Here is an example:
*L 15 DIS Test data set for MVSP
Since the clustering program uses a symmetrical matrix as input,
it only needs one number for the size of the data matrix. In
this case the size of the matrix is 15x15. The third element of
the header is a three letter phrase specifying whether the matrix
is a similarity (SIM) or dissimilarity (DIS) matrix. This code
MUST be separated from the number of objects by only one space,
or it will not be read correctly. The dissimilarity and
similarity procedure of this program automatically sets up its
output files in this manner for input into the clustering
procedure.
Here is an example of a clustering input file, generated from
an analysis of the above matrix, using the Spearman Rank Order
Correlation Coefficient:
MVSP Ver. 1.3 -- User's Manual Page 10
*L 10 SIM Test data set for MVSP - SPEARMAN
COL1 COL2 COL3 COL4 COL5 COL6 COL7 COL8 COL9 COL10
1.00
-0.15 1.00
0.36 -0.05 1.00
0.20 -0.97 0.05 1.00
-0.60 0.67 0.15 -0.60 1.00
0.30 0.21 -0.31 -0.00 0.10 1.00
0.30 -0.05 0.97 0.00 0.10 -0.50 1.00
-0.80 0.62 -0.41 -0.70 0.60 -0.30 -0.30
1.00
0.82 -0.55 -0.03 0.62 -0.82 0.41 -0.10
-0.87 1.00
0.10 0.67 0.67 -0.60 0.70 0.10 0.60
0.10 -0.41 1.00
Note that this file is a lower half matrix, with diagonals (the
1.00's) included. Other forms of matrices may also be specified
for input to the clustering program, as discussed below, but this
is the default output form of the similarities and
dissimilarities procedure.
RUNNING STATISTICAL PROCEDURES
==============================
When one of the statistical procedure options (A-F) are
chosen, you will first be asked for the name of the input data
file. You may obtain a directory of the default data disk and
path by typing a "?". You may then specify a certain file mask
(such as *.DAT for all files with a .DAT extension) or simply hit
the carriage return for all files. You may then enter the name
of the data file. The program will automatically add your
specified default extension if no extension is specified. So, if
your datafile is named "STUDY1.DAT" and your default extension is
*.DAT, you need only type "STUDY1". If you specify another
extension, or have a filename with no extension, the program will
recognize those as long as the full name is specified. A blank
carriage return here will return you to the main menu.
If you have elected (through the configuration menu) to have
output sent to the printer, then you will be prompted to make
sure that your printer is ready, and you will then go into the
statistical procedure you have selected.
If you have instead specified a default output file
extension, you will next be prompted for the name of the output
file. If you enter a blank carriage return, this output file
will default to the input file name plus the default output file
extension you have specified. The output file for an analysis of
STUDY1.DAT will default to STUDY1.OUT if your default output
extension is *.OUT. If you have chosen to run the
dissimilarities procedure, you will also be asked if you wish to
have the results input into the clustering procedure. If so,
another filename must be specified to contain just the distance
matrix, with none of the ancillary information. This filename
MVSP Ver. 1.3 -- User's Manual Page 11
defaults to the default extension for the cluster analysis input
files.
After the book-keeping business is taken care of, you will
then enter the actual procedure which you have chosen. These
will be discussed separately.
PRINCIPLE COMPONENTS ANALYSIS:
This procedure performs a simple R-mode principle components
analysis. The component loadings are scaled to unity, so that
the sum of squares of an eigenvector equals 1, and the component
scores are scaled so that the sum of squares equals the
eigenvalue. Q-mode PCA will have the opposite scaling. Note
that many packages, such as SPSS and SYSTAT, perform Q-mode PCA,
and thus their eigenvectors will be scaled to the eigenvalue,
rather than unity. Note also that the data matrices for MVSP
must be transposed for use with packages such as SPSS or SYSTAT
to obtain the same eigenvalues.
For details on the computation and assumptions of the PCA
technique, see Orloci (1978), Gauch (1982), and Pielou (1984).
Orloci gives a detailed mathematical discussion of the particular
algorithm used here, while Gauch and Pielou give very clear and
understandable discussions of the basis of the technique and its
use and assumptions.
The size of data matrix which can be analyzed is limited to
55x55 (45x45 for the 8087 version). In the R-mode analysis,
similarity coefficients are calculated for the descriptors, which
are the rows of the matrix (species in an ecological study,
characters in a numerical taxonomic study) and component scores
are calculated for the objects, which are the columns of the
matrix (samples or operational taxonomic units (OTU's)).
You will first be asked if you wish to have the raw data and
the similarity matrix printed out. In analyses of large data
sets, the printing of the data and similarity matrix can add a
little bit of time to the analysis, as well as a hefty pile of
paper. I find it useful to see this output, however,
particularly to check to see if the data was read correctly.
Next, you will be asked if you want the data to be log
transformed. PCA assumes a normal distribution of the data, but
this assumption is often not met. Log transforming the data can
reduce the skewness of the data, resulting in a more
interpretable analysis (Spicer & Hill, 1979). In my research
with fossil plant data, I've found this to be invaluable, as I
always have some samples with extremely high abundances of
certain taxa, and these taxa tend to dominate the analysis due to
their large numbers. Log transforming the data evens this out.
You are given the option of what base of logarithm to use.
When the procedure is run, you will have the option of using
either a covariance or correlation matrix, and of using either a
centered or uncentered data matrix. Generally a centered
MVSP Ver. 1.3 -- User's Manual Page 12
covariance matrix is used, but if different units of measurement
are used in the data matrix, these will need to be standardized,
and thus a correlation matrix should be used. Standardization
may also be desired to reduce the effects of dominant species, so
that rarer species play a greater role in the resulting
configuration. An uncentered data matrix is called for when
there is appreciable between-axes heterogeneity. This means that
different clusters of points are associated with different axes,
and have little projection on other axes. This often occurs when
different groups of samples have completely different sets of
common species, with little overlap. See Pielou (1984) for more
on this.
Status messages will be listed to the screen during the
analysis to let you know how things are proceeding. The final
results will also be listed out while they are being saved to the
output file or sent to the printer. The eigenvalues and their
percentage of the total variation will be printed along with the
component coefficients (or eigenvectors), then the component
scores for each principle component will be printed.
You may choose the minimum eigenvalue for which principle
components are printed out. The internal default is to print
components only if the eigenvalue is greater than the average
eigenvalue. This is often considered a good rule of thumb for
determining whether a component is interpretable (Legendre &
Legendre, 1983). You may change this default through the program
defaults option (G) on the main menu. A value of 0 will cause
all components to be printed out, and any other value, such as 1,
may also be entered as a minimum eigenvalue. This minimum value
may be saved in the MVSP.CNF configuration file along with the
colors and default datafile paths and extensions.
You may also have the component loadings and component scores
plotted on a scatter diagram. You will be asked how many axes
you wish to have plotted. If you choose three, for instance, the
first three axes will be plotted against each other in every
combination of two dimensional plots (3 plots in this case, 6 for
four axes, etc.). Entering a zero will bypass the plotting
procedure.
After the component plots, the raw data will be printed out
sorted by the first component scores and factors. This can be
useful for allowing you to see patterns and trends in the raw
data alone. If the first component accounts for a large
proportion of the variance, and if there is an interpretable
gradient along the first axis, then this pattern can be striking.
RECIPROCAL AVERAGING:
The reciprocal averaging procedure performs an eigenanalysis
form of reciprocal averaging. Again, see Orloci (1978), Gauch
(1982) and Pielou (1984) for details on this procedure. The
setup and usage of this procedure is similar to the PCA
procedure, with some differences. This procedure uses more
computer memory, with the result that the largest matrix which
MVSP Ver. 1.3 -- User's Manual Page 13
can be analyzed is 45x45 (40x40 for the 8087 version). There are
also a few more options available.
The analysis can be run with either a weighting of the rare
species or the common species. See Orloci (pp. 152-168) for
details of these methods of weighting. Also, the scores can be
adjusted to to percentages, to approximate the results of the
original RA algorithm as put forth by Hill (1973). The data file
should have species as the rows and samples as the columns, as in
the PCA procedure.
DISSIMILARITY AND SIMILARITIES:
This program calculates a variety of dissimilarity and
similarity measures. There are presently six measures available.
These procedures, and their formulas are:
Euclidean distance:
EDjk = SQRT [ SUMi SQR (Xij - Xik) ]
Cosine theta (or normalized Euclidean) distance:
CDjk = SQRT [ SUMi SQR (Xij / Yj - Xik / Yk) ]
where Yx = SQRT [ SUMx SQR (Xix) ]
Manhattan metric distance:
MMDjk = SUMi [ ABS (Xij - Xik) ]
Canaberra metric distance:
CMDjk = SUMi [ ( ABS (Xij - Xik) ) / (Xij + Xik) ]
Spearman rank order correlation coefficient:
SCCij = 1 - [ ( 6 * SUMk SQR (Rik - Rjk) )
/ (CUBE (N) - N) ]
where R = rank of variable
Pearson product moment correlation coefficient:
PCCij = [ SUMk (Xik - MEAN (Xi) ) * (Xjk - MEAN (Xj) ]
/ [ SQRT ( SUMk SQR (Xik - MEAN (Xk) ) )
* SQRT ( SUMk SQR (Xjk - MEAN (Xk) ) ) ]
(X = data value; ABS = absolute value; SQR = square;
SQRT = square root; MEAN = mean; CUBE = cubed;
SUM = sumation )
See Sneath & Sokal (1973), Pielou (1984), and Prentice (1980) for
discussions and derivations of these measures. The maximum size
of data matrix allowed is 95x95 (85x85 for the 8087 version).
The distances are calculated between the columns of the data
matrix. An option to transpose the data matrix before the
MVSP Ver. 1.3 -- User's Manual Page 14
analysis is included, to allow analysis of the rows without
requiring reentry of the data.
This procedure is set up to allow easy input of the distance
measures into the clustering analysis procedure. If you choose
to input the distance matrix into the clustering program, a copy
of the distance matrix along with the appropriate header
information will be put into a separate file from the full
output. This matrix file can then be used as input to the
clustering program.
CLUSTER ANALYSIS:
This procedure performs average linkage cluster analysis on
an input matrix of some sort of distance or similarity measure.
Four forms of average linkage clustering are presently available,
unweighted pair group, unweighted centroid, weighted pair group,
and weighted centroid (or median). For clear and concise
explanations of the theory and practice behind cluster analysis,
see Sneath and Sokal (1973) and Pielou (1984). The largest data
matrix this program can handle is 95x95 (85x85 for the 8087
version).
A number of different input formats are available, including
various forms of half matrices and full matrices (a lower half
matrix with a diagonal, the output form of the dissimilarity
procedure, is the default). You must also specify whether the
input measure is a similarity or dissimilarity measure (if it
isn't specified in your data file header).
The output of the procedure consists of a report of the
status of the clustering procedure as each new object is added to
the cluster. The average similarity or dissimilarity of the two
groups which have just been joined is printed out, along with a
listing of the two groups and the number of objects in the newly
fused group. If a single object is added to another cluster, the
label for that object (or a numerical label corresponding to its
position in the data matrix) is printed out. If a whole group is
added, the node at which that group was last added to is printed
out. For instance, a report such as:
NODE GROUP 1 GROUP 2
1 COL1 COL2
2 COL4 COL5
3 NODE 1 COL3
4 NODE 3 NODE 2
would correspond to a dendrogram of the form:
MVSP Ver. 1.3 -- User's Manual Page 15
COL1 COL2 COL3 COL4 COL5
| | | | |
------- | -------
| | |
---------- |
| |
---------------
|
The actual lengths of the branches of this dendrogram would
depend on the average similarity of each group as they are fused.
The dendrogram can be reconstructed by hand, or the dendrogram
can be plotted using computer graphics programs. Joseph
Felsenstein's cladistic package PHYLIP contains a program written
by Christopher Meacham for drawing cladograms and dendrograms.
See Felsenstein (1985) for details on the availability of this
free package.
DIVERSITY INDICES:
This procedure computes three of the most commonly used
diversity indices used in ecology, Simpson's, Shannon's, and
Brillouin's. See Pielou (1969) for a discussion of the use and
derivation of these indices.
The input data file should be set up with species as rows and
samples as columns. The diversity, then, is calculated for each
column. The largest data matrix which can be processed is 95x95
(85x85 for the 8087 version). Be forewarned that the Brillouin
index calculates factorials of the species abundances, and if any
of your abundances are high, this could take a VERY LONG TIME!
Data matrices with numerous species abundances on the order of
hundreds or thousands could make for a rather long coffee break!
The output consists not only of the diversity index, but also
the number of species and the evenness, which is defined as the
diversity divided by the log of the number of species (Pielou,
1969)).
FUTURE PLANS
My plans for future versions of this program include adding
character graphics procedures for the clustering procedure and
adding more coefficients to the dissimilarities and similarities
procedure. I am also considering adding Bray & Curtis polar
ordination, with some of the modifications which have been
suggested by Beals (see his 1984 paper for summaries), as well as
detrended correspondence analysis (see Hill & Gauch, 1980). Any
comments on favorite statistics out there? Let me know what you
would like to see in this program.
I also hope to figure out a way to increase the size of the
data matrices that this program accept. They are now limited by
the 64K limit that Turbo Pascal imposes for the size of the data
MVSP Ver. 1.3 -- User's Manual Page 16
and stack segments. My attempts to use memory outside of that
64K space for the data matrices have met with some very strange
results (including one time when my screen began flashing a
psychedelic pattern of ASCII characters while the computer
proceeded to trash out my data disk; see what I mean by demons?).
If you have any other comments about the procedures in this
program, or about procedures NOT in this program, which you feel
would be useful to include, these should be sent to me at the
address on the title page of this manual. THANK YOU!
8087 SUPPORT
If you aren't satisfied with the speed of this program, a
faster version which uses the 8087 math coprocessor is available.
This coprocessor (which is an optional chip that can be plugged
into your computer and costs anywhere from $100-$200) greatly
speeds up the processing of real number, floating point
arithmetic. Often this increase in speed can amount to 10 times!
Turbo Pascal, the compiler used for this package, offers a
special compiler which creates programs which take advantage of
this processor. The programs compiled with this special compiler
will only work on machines which have the 8087 installed. They
also will have lower limits on the data matrix size, since the
8087 version of Turbo Pascal uses more memory to store each
number (and hence has a greater accuracy in its computations).
A version of this program which has been compiled for the
8087 is available to registered users (those who have made a
voluntary monetary contribution; see below). If you are working
with smaller matrices (maximum matrix sizes are specified in the
procedure descriptions above), then this could speed things up a
good bit. For example, a PCA of a 45x45 data matrix took one
hour with the normal version of the program, but only twenty
minutes with the 8087 version.
THE USER SUPPORTED CONCEPT
This software package is being distributed under the user
supported concept. In case you haven't run across this software
phenomenon, the following is a brief discussion of it's tenets.
User supported software is an experiment in "grass-roots"
software distribution and development. Andrew Fluegelman, one of
the pioneers of this phenomenon, expressed it this way:
1) The value and utility of software is best assessed by the
user on his or her own system.
2) The creation of new and useful software should be
supported by the computing community.
3) Copying and sharing of software that you have found useful
should be encouraged, rather than restricted.
MVSP Ver. 1.3 -- User's Manual Page 17
User supported programs, such as this, are freely distributed
to the computing community, through the network of electronic
bulletin board services, local computer user groups, word of
mouth, and networks of friends with similar interests. The user
support comes in two forms:
1) The user is encouraged to evaluate the program, suggest to
the author any changes in the program which would be
useful, and recommend the program to others if it is worth
recommending.
2) The user is encouraged to support further programming
efforts (including enhancements of this program) through a
voluntary monetary contribution to the program author.
User supported means that you don't have to pay outrageous
prices for a commercial package without even getting a chance to
test drive it first to see if it really meets your needs. User
supported means that if YOU, the user, decides that this program
is worth supporting, then you support it voluntarily, for a
reasonable cost, and without the hassles of copy-protection and
the high cost of advertising.
You are encouraged to copy and distribute this program. If
you find this program to be useful, a voluntary contribution to
the author ($25 suggested) would be appreciated. This program is
copyrighted, and no price may be charged for this program by any
person other than the author (Warren L. Kovach). A nominal fee
may be charged for distribution costs, such as for the media and
postage and handling, as long as this fee does not exceed $5.
All registered users (users who have made the voluntary
contribution of $25 or more) will be placed on my mailing list,
and they will be notified of new versions and new features of
this program, and will be entitled to upgrades to newer versions
for only the cost of postage and the disk (about $5). They will
also be entitled to versions of the program compiled for the 8087
math coprocessor, also for only the postage and media cost.
Thank you for supporting MVSP!
MVSP Ver. 1.3 -- User's Manual Page 18
APPENDIX: Test Data Files
The following are listings of some example data files which
are distributed with MVSP. These data files are taken from the
published literature, and the user may compare the MVSP results
with those of the original analyses.
File JOLIMOSI.DAT:
These data are taken from Jolicoeur & Mosimann (1960), a
pioneering study using PCA in morphometrics. The data are
measurements (in millimeters) of the length, width, and height of
the carapices of 24 male painted turtles (Chrysemys picta
marginata). They interpret the first PC as corresponding to size
increase (growth) while the second & third PC's are interpreted
as shape variation.
*L 3 24 Turtle carapice data from Jolicoeur & Mosimann, 1960, males.
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15 T16 T17 T18
T19 T20 T21 T22 T23 T24
LENGTH 93 94 96 101 102 103 104 106 107 112 113 114 116 117 117 119
120 120 121 125 127 128 131 135
WIDTH 74 78 80 84 85 81 83 83 82 89 88 86 90 90 91 93 89 93 95 93 96
95 95 106
HEIGHT 37 35 35 39 38 37 39 39 38 40 40 40 43 41 41 41 40 44 42 45 45
45 46 47
File GAUCH.DAT:
These data are taken from Gauch (1982). These are composite
samples of upland forest communities from southern Wisconsin,
taken from a pioneer (sample 1) to climax (sample 10) gradient.
He uses these data to demonstrate many different ordination
techniques. He doesn't analyze these data with RA, but he does
use detrended correspondence analysis on these data, with similar
results to MVSP's RA program (particularly on the first axis).
*L 14 10 Wisconsin forest communities data from Gauch, 1982, Table 4.4
S1 S2 S3 S4 S5 S6 S7 S8 S9 S10
QUER.MAC 9 8 3 5 6 0 5 0 0 0
QUER.VEL 8 9 8 7 0 0 0 0 0 0
CARY.OVA 6 6 2 7 0 2 0 0 0 0
PRUN.SER 3 5 6 6 6 4 5 0 4 1
QUER.ALB 5 4 9 9 7 7 4 6 0 2
JUGL.NIG 2 0 0 0 3 5 6 4 3 0
QUER.RUB 3 4 0 6 9 8 7 6 4 3
JUGL.CIN 0 0 5 0 2 0 0 2 0 2
ULMU.AME 2 2 4 5 6 0 5 0 2 5
TILI.AME 0 0 0 0 2 7 6 6 7 6
ULMU.RUB 4 0 2 2 5 7 8 8 8 7
CARY.COR 0 0 0 0 0 5 6 4 0 3
OSTR.VIR 0 0 0 0 0 0 7 4 6 5
ACER.SAC 0 0 0 0 0 5 4 8 8 9
MVSP Ver. 1.3 -- User's Manual Page 19
REFERENCES
Beals, E.W., 1984. Bray-Curtis Ordination: An Effective Strategy
for Analysis of Multivariate Ecological Data. Adv. in Ecol.
Research, 14:1-55.
Cooke, D., Craven, A.H., & Clarke, G.M., 1982. Basic Statistical
Computing. Edward Arnold (Publishers) Ltd., London.
Felsenstein, J., 1985. Confidence Limits on Phylogenies: An
Approach Using the Bootstrap. Evolution 39:783-791.
Gauch, H.G. Jr., 1982. Multivariate Analysis in Community
Ecology. Cambridge University Press, New York.
Greig-Smith, P., 1983. Quantitative Plant Ecology. University
of California Press, Berkely.
Hill, M.O., 1973. Reciprocal Averaging: An Eigenvector Method of
Ordination. Journal of Ecology, 61:237-249.
Hill, M.O., & Gauch, H.G. Jr., 1980. Detrended Correspondence
Analysis: An Improved Ordination Technique. Vegetatio 42:47-
58.
Jolicoeur, P., & Mosimann, J.E., 1960. Size and Shape Variation
in the Painted Turtle. A Principle Component Analysis.
Growth, 24:339-354.
Legendre, L., & Legendre, P., 1983. Numerical Ecology. Elsevier
Scientific Publishing Company, New York.
Orloci, L., 1978. Multivariate Analysis in Vegetation Research,
2nd edition. W. Junk, Boston.
Pielou, E.C., 1969. An Introduction to Mathematical Ecology.
Wiley-Interscience, New York.
Pielou, E.C., 1984. The Interpretation of Ecological Data.
Wiley-Interscience, New York.
Prentice, I.C., 1980. Multidimensional Scaling as a Research
Tool in Quaternary Palynology: A Review of Theory and
Methods. Review of Paleobotany & Palynology, 31:71-104.
Sneath, D.H., & Sokal, R.R., 1973. Numerical Taxonomy. W.H.
Freeman & Co., San Francisco.
Spicer, R.A., & Hill, C.R., 1979. Principle Components and
Correspondence Analysis of Quantitative Data from a Jurassic
Plant Bed. Review of Paleobotany & Palynology, 28:273-299.